Extracting A ribute-Value Pairs from Product Specifications on the Web

نویسندگان

  • Petar Petrovski
  • Christian Bizer
چکیده

Comparison shopping portals integrate product o ers from large numbers of e-shops in order to support consumers in their buying decisions. Product o ers often consist of a title and a free-text product description, both describing product attributes that are considered relevant by the speci c vendor. In addition, product o ers might contain structured or semi-structured product speci cations in the form of HTML tables and HTML lists. As product speci cations often cover more product attributes than free-text descriptions, being able to extract attribute-value pairs from these speci cations is a critical prerequisite for achieving good results in tasks such as product matching, product categorisation, faceted product search, and product recommendation. In this paper, we present an approach for extracting attributevalue pairs from product speci cations on the Web. We use supervised learning to classify the HTML tables and HTML lists within a web page as product speci cation or not. In order to extract attribute-value pairs from the HTML fragments identi ed by the speci cation detector, we again use supervised learning to classify columns as attribute column or value column. Compared to DEXTER, the current state-of-the-art approach for extracting attributevalue pairs from product speci cations, we introduce several new features for speci cation detection and support the extraction of attribute-value pairs from speci cations having more than two columns. This allows us to improve the F-score up to 10% for extracting attribute-value pairs from tables and up to 3% for lists. In addition, we report the results of using duplicate-based schema matching to align the product attribute schemata of 32 di erent e-shops. This experiment con rms the suitability of duplicate-based schema matching for product data integration.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web

The web is a rich resource of structured data. There has been an increasing interest in using web structured data for many applications such as data integration, web search and question answering. In this paper, we present DEXTER, a system to find product sites on the web, and detect and extract product specifications from them. Since product specifications exist in multiple product sites, our ...

متن کامل

Extracting and Using Attribute-Value Pairs from Product Descriptions on the Web

We describe an approach to extract attribute-value pairs from product descriptions in order to augment product databases by representing each product as a set of attribute-value pairs. Such a representation is useful for a variety of tasks where treating a product as a set of attribute-value pairs is more useful than as an atomic entity. We formulate the extraction task as a classification prob...

متن کامل

IRIS: A Protégé Plug-in to Extract and Serialize Product Attribute Name-Value Pairs

This article introduces IRIS wrapper, which is developed as a Protégé plug-in, to solve an increasingly important problem: extracting information from the product descriptions provided by online sources and structuring this information so that is sharable among business entities, software agents and search engines. Extracted product information is presented in a GoodRelations-compliant ontology...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Automatically Extracting Web API Specifications from HTML Documentation

Web API specifications are machine-readable descriptions of APIs. These specifications, in combination with related tooling, simplify and support the consumption of APIs. However, despite the increased distribution of web APIs, specifications are rare and their creation and maintenance heavily relies on manual efforts by third parties. In this paper, we propose an automatic approach and an asso...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017